729 research outputs found
A comparison of personal name matching: Techniques and practical issues
Finding and matching personal names is at the core of an increasing number of applications: from text and Web mining, information retrieval and extraction, search engines, to deduplication and data linkage systems. Variations and errors in names make exact string matching problematic, and approximate matching techniques based on phonetic encoding or pattern matching have to be applied. When compared to general text, however, personal names have different characteristics that need to be considered.
¶
In this paper we discuss the characteristics of personal names and present potential sources of variations and errors. We overview a comprehensive number of commonly used, as well as some recently developed name matching techniques. Experimental comparisons on four large name data sets indicate that there is no clear best technique. We provide a series of recommendations that will help researchers and practitioners to select a name matching technique suitable for a given data set
Towards Parameter-free Blocking for Scalable Record Linkage
Linking or matching databases is becoming increasingly important in
many data mining projects, as linked data can contain information that
is not available otherwise, or that would be too expensive to collect.
A main challenge when linking large databases is the complexity of the
linkage process: potentially each record in one database has to be
compared with all records in the other database. Various techniques,
collectively know as `blocking', have been developed to deal with this
quadratic complexity. Most of these techniques require several
parameters to be set by the user in order to achieve good results. In
this paper we evaluate six blocking techniques within a common
framework with regard to the number and quality of the candidate
record pairs generated. We propose a modification to two existing
techniques that reduces the variance in the quality of the blocking
results over a range of parameter values, enabling more robust,
practical record linkage without the need of time consuming manual
parameter tuning
Noise-tolerant approximate blocking for dynamic real-time entity resolution
Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. This process needs to deal with noisy data that contain for example wrong pronunciation or spelling errors. Many real world applications require rapid responses for entity queries on dynamic datasets. This brings challenges to existing approaches which are mainly aimed at the batch matching of records in static data. Locality sensitive hashing (LSH) is an approximate
blocking approach that hashes objects within a certain distance into the same block with high probability. How to make approximate blocking approaches scalable to large datasets and effective for entity resolution in real-time remains an open question. Targeting this problem, we propose a noise-tolerant approximate blocking approach to index records based on their distance ranges using LSH and sorting trees within large sized hash blocks. Experiments conducted on both synthetic and real-world
datasets show the effectiveness of the proposed approach
Time-aware topic recommendation based on micro-blogs
Topic recommendation can help users deal with the information overload issue in micro-blogging communities. This paper proposes to use the implicit information network formed by the multiple relationships among users, topics and micro-blogs, and the temporal information of micro-blogs to find semantically and temporally relevant topics of each topic, and to profile users' time-drifting topic interests. The Content based, Nearest Neighborhood based and Matrix Factorization models are used to make personalized recommendations. The effectiveness of the proposed approaches is demonstrated in the experiments conducted on a real world dataset that collected from Twitter.com
Context Aware Computing for The Internet of Things: A Survey
As we are moving towards the Internet of Things (IoT), the number of sensors
deployed around the world is growing at a rapid pace. Market research has shown
a significant growth of sensor deployments over the past decade and has
predicted a significant increment of the growth rate in the future. These
sensors continuously generate enormous amounts of data. However, in order to
add value to raw sensor data we need to understand it. Collection, modelling,
reasoning, and distribution of context in relation to sensor data plays
critical role in this challenge. Context-aware computing has proven to be
successful in understanding sensor data. In this paper, we survey context
awareness from an IoT perspective. We present the necessary background by
introducing the IoT paradigm and context-aware fundamentals at the beginning.
Then we provide an in-depth analysis of context life cycle. We evaluate a
subset of projects (50) which represent the majority of research and commercial
solutions proposed in the field of context-aware computing conducted over the
last decade (2001-2011) based on our own taxonomy. Finally, based on our
evaluation, we highlight the lessons to be learnt from the past and some
possible directions for future research. The survey addresses a broad range of
techniques, methods, models, functionalities, systems, applications, and
middleware solutions related to context awareness and IoT. Our goal is not only
to analyse, compare and consolidate past research work but also to appreciate
their findings and discuss their applicability towards the IoT.Comment: IEEE Communications Surveys & Tutorials Journal, 201
Context-aware Dynamic Discovery and Configuration of 'Things' in Smart Environments
The Internet of Things (IoT) is a dynamic global information network
consisting of Internet-connected objects, such as RFIDs, sensors, actuators, as
well as other instruments and smart appliances that are becoming an integral
component of the future Internet. Currently, such Internet-connected objects or
`things' outnumber both people and computers connected to the Internet and
their population is expected to grow to 50 billion in the next 5 to 10 years.
To be able to develop IoT applications, such `things' must become dynamically
integrated into emerging information networks supported by architecturally
scalable and economically feasible Internet service delivery models, such as
cloud computing. Achieving such integration through discovery and configuration
of `things' is a challenging task. Towards this end, we propose a Context-Aware
Dynamic Discovery of {Things} (CADDOT) model. We have developed a tool
SmartLink, that is capable of discovering sensors deployed in a particular
location despite their heterogeneity. SmartLink helps to establish the direct
communication between sensor hardware and cloud-based IoT middleware platforms.
We address the challenge of heterogeneity using a plug in architecture. Our
prototype tool is developed on an Android platform. Further, we employ the
Global Sensor Network (GSN) as the IoT middleware for the proof of concept
validation. The significance of the proposed solution is validated using a
test-bed that comprises 52 Arduino-based Libelium sensors.Comment: Big Data and Internet of Things: A Roadmap for Smart Environments,
Studies in Computational Intelligence book series, Springer Berlin
Heidelberg, 201
- …